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Executive  Overview 

This  report  covers  the  period  from  April  to  October  19S3  on  contract  No. 
NOOU'jy-C-0107,  A  few  of  the  highlights  of  this  report  follow. 

In  our  architecture  research  a  41,000  transistor  second  generation  32  bit 
processor,  which  demonstrates  the  advantages  of  the  reduced  instruction  set 
concept,  has  been  successfully  fabricated  and  tested.  In  addition  a  46,000 
transistor  instruction  cache  was  also  found  to  be  fully  operational  and  it  incor¬ 
porated  a  new  redundancy  idea  that  tripled  the  yield.  A  new  cad  tool,  a  timing 
verifier  called  Crystal,  was  found  to  be  extremely  useful  as  it  detected  potential 
performance  mistakes  during  the  design  process. 

In  the  computer  aids  for  design  and  layout  research,  85  copies  of  the  new 
tools  tape  have  been  distributed  to  university  and  industrial  labs  in  the  U.S. 
This  tape  contains  about  25  programs,  including  several  new  programs  (Lyra. 
Crystal.  Peg  and  Tpack).  A  new  improved  version  of  the  Waveform-Relaxation 
based  simulator  has  been  developed  which  utilizes  new  techniques  to  speed  up 
convergence  and  for  error  controL  A  new  algorithm  called  BLOSSOM  has  been 
developed,  which  can  exploit  the  parallelism  of  VLSI  for  the  solution  of  large- 
scale  linear  systems  of  algebraic  equations.  _ 

The  circuitry  contained  on  one  Multibus  card  has  been  developed  to  be  able 
to  recognize  up  to  1000  words  of  speech  in  real  time.  This  board  will  have  the 
equivalent  of  118  MIPS  of  von  Neumann  equivalent  operations.  The  algorithm 
uses  dynamic  time  warping,  which  is  the  most  successful  of  techniques  at  this 
time.  The  board  has  sufficient  flexibility  to  allow  considerable  algorithmic 
enhancements,  which  should  make  it  extremely  useful  for  further  developments 
in  speech  recognition  research.  The  board  uses  two  custom  circuits,  one  of 
which  was  designed  fully  automatically  from  a  high  level  description  requiring  no 
layout  from  the  algorithm  developer. 

Models  have  been  developed  which  explain  the  operation  of  transistors 
which  have  been  scaled  to  near  the  ultimate  limits  in  oxide  thickness.  The  relia¬ 
bility  of  these  devices  has  been  investigated  as  well  as  performance  degradation 
from  hot  and  tunneling  electrons. 
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The  RISC  II  processor  mnd  the  instruction  cache  chip  have  been  received 
from  fabrication  and  have  been  tested. 

RISC  11  [l],  a  92-bit  NMOS  microprocessor  using  41,000  transistors,  is  an 
Improved  version  of  the  RISC  I  chip.  It  is  25%  smaller  than  its  predecessor  t-.ven 
though  it  has  75%  more  registers  using  the  same  design  rules.  The  sharing  of 
the  bit  lines  for  Tending  and  writing,  which  made  this  size  reduction  possible, 
required,  however,  an  extra  pipe  stage  plus  operand  forwarding  circuits. 

like  RISC  1,  RISC  II  worked  on  the  first  silicon.  This  time,  however,  the  per¬ 
formance  was  (dose  to  what  we  predicted  in  part  because  of  careful  design  and 
extensive  Spice  simulation  of  critical  date-path  delays,  and  In  part  because  of 
Crystal,  a  timing  verifier  developed  by  John  Ousterhout.  Crystal  was  used  to  find 
the  lime  critical  paths  and  to  verify  that  the  less  regular  parts  or  Lhe  circuitry 
(e.g.  control)  were  matched  in  speed  to  the  highly  optimized  data  path.  The 
predicted  RISC  D  cycle  time  (Le.  execution  of  a  register- to-reglster  instruction) 
was  480  nsec.  In  the  lab  RISC  II  chips  run  at  500  ns  per  instruction  (VDD=5V, 
VBB=VSS=0V.  room  temperature),  lhe  average  power  consumption  is  1.25 
Watts,  a  little  less  than  we  had  anticipated. 

■Benchmark  simulations  showed  that,  a  800  nsec  RISC  IT  runs  integer  C  pro¬ 
grams  faster  than  aB-MHz  1AFX-288,  10-MHz  NS  16032,  12-MHz  88000,  or  18-MHz 
HP  8000  CPU. 

Wa  also  tested  the  RISC  II  Instruction  cache  [2],  a  46,500  transistor  NMOS 
chip  (the  largest  chip  we  have  built  so  far).  The  instruction  cache  also  worked 
on  first  silicon.  Again.  Ousterhout' a  Crystal  program  was  used  to  verify  the  tim¬ 
ing  of  the  cache,  and  again  its  usefulness  was  proven.  It  uncovered  one  perfor¬ 
mance  mistake  —  the  ratio  an  one  gate  was  4/1  instead  of  1/4  —  which  would 
have  stretched  a  70  ns  dock  phase  to  "700  ns.  When  testing  the  corrected 
design,  the  fastest  elds  was  found  to  have  -a  250  ns  access  time  with  a  480  ns 
cycle  lime  (VDDsSV,  VBB*VSS*0V,  room  temperature),  comparing  favorably 
with  the  projected  400  ns  time.  The  instruction  cache  is  thus  compatible  with 
the  500  ns  RISC  11  CPU.  The  average  power  consumption  was  1  Watt,  less  than  we 
expected. 

There  were  severe!  new  ideas  in  the  cache  chip,  including  one  to  improve 
yield.  Caches  already  have  a  bit  par  cache  block,  used  on  power  up.  that  indi¬ 
cates  Invalid  data.  We  added  an  extra  invalid  bit  to  put  blocks  Into  a  permanent 
’Invalid"  state  if  they  were  found  defective.  We  tested  13  chips:  the  new  idea  tri¬ 
pled  the  yield. 

In  the  lest  8  months  several  papers  sponsored  by  DARPA  have  been  repub¬ 
lished  [3]  [4]  [5]  [8].  We  also  saw  ths  announcement  of  the  Pyramid  minicom¬ 
puter,  the  first  commercial  computer  to  use  ideas  we  developed  at  Berkeley. 
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tft  ’VLSI  Communication  Components  For  Multiprocessors  (C.  Sequin) 


Me  have  studied  an  approach  to  supercomputer!  that  is  particularly  suited 
Bor  Implementation  with  VLSI  technology.  It  relies  on  a  large  pool  of  single-chip 
computers  that  operate  concurrently  on  many  subtasks  of  a  complex  problem. 
Multiprocessors,  in  the  past,  have  often  fallen  short  of  the  expected  perfor¬ 
mance  because  of  a  lack  of  interprocessor  communication  bandwidth.  As  the 
number  of  computers  in  the  system  is  increased  from  a  few  tens  to  several  hun¬ 
dreds.  communication  will  become  even  more  of  a  bottleneck  unless  the  system 
is  planned  in  such  a  manner  that  the  available  total  bandwidth  grows  automati¬ 
cally  with  the  number  of  processors  in  the  system.  This  research  produces 
guidelines  for  the  construction  of  VLSI  communication  components  for  this 
framework. 

The  approach  studied  In  detail  concerns  a  "communications  domain"  of 
arbitrary  topology  created  from  VLSI  components  linked  by  high-speed  dedi¬ 
cated  links.  It  is  shown  that  the  power  limitations  of  the  individual  chips  make  it 
advantageous  to  concentrate  all  available  output  bandwidth  into  a  few  ports  with 
maximum  bandwidth.  These  components  would  also  have  the  built-in  facilities 
for  low-level  functions  such  as  handshaking,  buffering,  and  flow  control,  so  that 
the  information  packet  or  the  block  of  data  to  be  transmitted  is  the  lowest  prim¬ 
itive  the  systems  builder  needs  to  be  concerned  about  The  design  of  such 
modular  VLSI  communications  components  is  being  investigated,  cost-estimates 
in  terms  of  the  number  of  devices  needed  for  their  implementation  are  per¬ 
formed.  and  the  trade-oils  in  the  various  design  parameters  are  analyzed. 


1.3.  Hovel.  High-Performance  Architectures  (3.  Baden.  A.  Despain) 

The  first  phase  of  our  investigation  into  the  use  of  functional  languages  for 
array  processing  culminated  in  a  presentation  at  COMPCON  *83  [3].  At  COMPCON 
we  reported  on  our  experiences  with  our  own  Implementation  of  Backus’  func¬ 
tional  language.  TP  TlJ.  (Berkeley  TP  is  now  available  on  the  latest  release  of 
Berkeley  UNIX™  [2]).  We  found  that  while  FP'a  Program  Farming  Operations 
(Pro’s)  were  a  powerful  facility  few  expressing  concurrency,  their  generality  was 
severely  diminished  owing  to  the  lack  of  use  indefinable  PFO's.  As  a  result,  user- 
defined  program  building  blocks  could  not  be  re-used  for  applications  related 
their  original  purpose.  rP  programs  tend  to  be  data  structure  intensive  owing 
to  a  lack  of  data  abstraction  facilities,  La.,  the  programmer  must  pay  undue 
attention  to  the  low-level  representation  details.  Since  IP’s  semantics  are 
strict,  ell  intermediate  results  must  be  fully  evaluated  before  being  passed  on  to 
their  enclosing  expressions.  This  is  unfortunate,  as  sometimes  it  is  convenient 
to  defer  a  complete  evaluation:  aither  to  allow  for  infinite  streams  of  data,  e.g. 
I/O,  (see  for  example  Henderson  and  Morris'  work  on  lazy  evaluation  [9])  to 
admit  infinite  expressions  into  the  language  [13]  (a  powerful  programming  tech¬ 
nique),  or  to  exploit  some  well-known  optimization  techniques  (see  Guib&s  and 
Wyatt's  treatment  of  delayed  evaluation  In  APL  [8]).  Finally,  we  found  that  FP's 
ambiguous  treatment  of  errors  coupled  with  Its  strict  semantics  made  exception 
Handling  impossible  and  debugging  difficult  at  best. 

We  have  developed  our  own  functional  language,  called  XBA,  to  provide  the 
abstraction  and  error  handling  facilities  that  a  functional  language  must  have  to 
be  practical  XBA  bas  strict  semantics,  lending  itself  to  a  lazy  implementation. 
The  language  is  based  in  part  on  Turner’s  functional  language  kRC  [13].  All 
Information  is  treated  uniformly  in  XBA;  hence,  the  programmer  may  define  his 


own  Pro*!  (they  arc  just  functions  of  functions)  and  error  values  may  be  mani¬ 
pulated  as,  say,  numbers  would  be.  A  simple  data  typing  facility  is  provided: 
Pascal-tike  structural  type  definitions  and  optional  type  declarations  (XBA  is  not 
strongly  typed  but  declarations  help  the  compiler  generate  faster  code).  In 
addition,  PROLOG-Uke  argument  pattern  matching  is  provided. 

We  have  found  that  XBA  is  better  suited  to  writing  concurrent  software  than 
IP  owing  to  its  abstraction  facilities  and  to  its  uniform  treatment  of  information. 
As  a  result  machine  independent  code  is  easier  to  write  and  software  is  generally 
reusable.  In  addition,  the  adoption  of  non-strict  semantics  lends  itself  to  a 
natural  treatment  of  1/0  and  to  optimizations  that  avoid  temporary  storage, 
unnecessary  computations,  or  both. 

As  we  have  compiled  XBA  we  have  investigated  applicative  instruction  sets 
based  cm  the  work  of  Turner  and  others  [12,4].  Turner  has  reported  on  a  combi¬ 
natory  representation  for  KRC  programs,  Le..  one  that  contains  no  bound  vari¬ 
ables.  Besides  constituting  the  first  truly  applicative  instruction  set  the 
representation  also  brings  into  being  the  notion  of  compile-time  in  the  imple¬ 
mentation  of  functional  languages  (These  languages  tend  to  be  executed  inter¬ 
pretive  ly).  The  combinatory  representation  blurs  the  distinction  between  com¬ 
pile  time  and  run  time;  indeed,  some  function  calls  can  be  executed  at  compile 
time  to  reduce  the  number  of  run  time  procedure  calls  (traditionally,  this  is 
done  with  macros).  Also,  combinatory  code  is  dynamically  stlf -improving  Le., 
on-tbe-fly  code  generation  is  unnecessary.  We  conducted  an  extensive  (and  as  of 
yet,  unpublished)  survey  on  the  topic,  concluding,  that  like  data-flow  implemen¬ 
tations.  combinatory  ones  impose  a  prohibitive  overhead  on  task  partitioning 
and  scheduling,  particularly  on  the  sorts  of  array  computations  in  which  we  are 
interested.  Although  some  of  this  overhead  can  be  removed  using  newly  pub¬ 
lished  data  flow  analysis  techniques  [10]  it  will  be  necessary  to  provide  array 
facilities  as  primitives  if  such  an  implementation  strategy  is  to  be  at  all  feasible. 
That  brings  ua  to  our  present  research  effort. 

Currently  we  are  investigatingthe  design  and  implementation  of  array  facil¬ 
ities  for  XBA  In  this  context  xBA  will  be  used  as  an  abstract  notation  to 
describe  the  semantics  of  the  facilities,  rather  than  as  a  concrete  language. 
This  approach  ensures  that  our  results  will  be  applicable  to  a  wider  variety  of 
implementation  styles  than  if  we  chose  a  concrete  syntax  for  the  language  (e.g.. 
the  facilities  could  be  implemented  as  ADA  packages  and  a  special  preproces¬ 
sor). 

The  use  of  the  proposed  array  facilities,  coupled  with  the  availability  of 
PFO's,  encourages  the  programmer  to  favor  sfytiscd,  e.g.,  patterned  array 
accesses  over  random,  low-level  ones.  This  level  of  data  access  lends  itself  to  a 
simple,  efficient  implementation  strategy  for  generating  and  manipulating 
memory  access  patterns  (see  for  example  the  work  of  Guibas  and  Wyatt  on  com¬ 
piling  APL  [0]).  In  the  first  case,  computations  may  be  partitioned  amongst 
replicated  resources  rapidly,  and,  in  the  second,  data  access  patterns  may  be 
timed  to  to  take  advantage  of  particular  hardware  structures  (e.g.<  vector  regis¬ 
ters  In  the  Cray),  obviating  machine-dependent  code.  The  memory  system 
design  supports  multidimensional  parallel  accesses  (see  Lawrie  and  Vora's  work 
on  the  prime  memory  system  [11])  and  localized  address  computations.  In  foto, 
the  combination  of  low-overhead  task  partitioning  and  decentralized  address 
computations  make  applicative  architectures  (e.g.,  either  reduction  [5]  or 
dataflow  ffll)  much  more  attractive  for  array  processing  than  they  has  been  in 
the  past  [7J. 

Our  final  goal  is  to  provide  a  complete  specification  of  the  array  facilities 
along  with  the  memory  system  design.  The  design  of  the  array  facilities  will  take 
precedent  over  the  memory  system  design,  so  simple  assumptions  will  be  made 


concerning  the  design  of  the  processor-memory  interconnection  network.  The 
final  design  will  be  evaluated  through  simulation  to  determine,  for  Instance,  the 
else  and  numbers  of  memories.  We  are  meeting  regularly  with  mathematicians 
and  computational  physicists  to  determine  a  reasonable  set  of  benchmark  pro¬ 
grams  with  which  to  drive  those  simulations. 

hi  sum.  we  have  determined  the  appropriate  features  required  of  a  practi¬ 
cal  functional  language  tor  array  processing:  concurrency,  abstraction  facili¬ 
ties.  error  recovery  mechanisms,  and  non-strict  semantics.  We  have  developed 
a  language  that  meets  these  requirements  and  an  Implementation  strategy  to 
overcame  the  "von  Neumann  bottleneck." 


(1)  John  Backus,  "Can  Programming  be  liberated  from  the  von  Neumann 
Style?  A  Functional  Style  and  its  Algebra  of  Programs,"  Cbmm.  ACM  21,  8 
(Aug.  1978),  818-841.  1977 ACM  Turing  Award  lecture. 

(2)  Scott  B.  Baden,  Berkeley  FP  [tear's  Manual,  Rev.  4.1,  Univ.  of  California. 
Berkeley.  CA.  Dec.  1962. 

(3)  Scott  B.  Baden  and  Darab  R.  Patel,  "Berkeley  FP  -  Experiences  with  a 
Functional  Programming  Language,"  Can/.  Rae.  COMPCON  * 83,  San  Fran¬ 
cisco,  Mar.  1963. 

(4)  F.  Warren  Burton.  "A  linear  Space  Translation  of  Functional  Pro¬ 
gram*  to  1\imer  Combinatom,*'  fnfo.  Procter.  fetters  14,  6  (23  Jul. 
1982),  201-204. 

(5)  John  Darlington  and  Mike  Reeve,  "ALICE?  A  Multi-  Processor  Reduction 
Machine  for  the  Parallel  Evaluation  of  Applicative  Languages."  Proc. 
Can/.  Functional  languages  and  Cbmputer  Architecture,  Portsmouth, 
NH.  1961,  85-76. 

(6)  Jack  B.  Dennis,  Guang-Rcng  Gao  and  Kenneth  W.  Todd,  "A  Data  Flow  Super¬ 
computer,"  MIT  Laboratory  for  Computer  Science,  Mar.  1962. 

(7)  D.  D.  Gajsld.  D.  A.  Padua  and  D.  J.  Kuck.  "A  Second  Opinion  on  Data  Flow 
Machines  and  Languages,"  Cbmpufvr,  Feb.  1962,  56-89. 

(8)  Leo  J.  Guibas  and  Douglas  K.  Wyatt,  "Compilation  and  Delayed  Evaluation 
in  APL"  Proc.  SthPOPL,  1976. 

(9)  Peter  Henderson  and  James  Morris,  "A  Lazy  Evaluator,”  Proc.  3rd  POPL, 
Jen.  1978,  95-103. 

(10)  Kent  Karlsson.  "An  Outline  of  the  Sky  Reduction  Machine,"  Report  #  1981- 
02-03,  Dept  of  CompuL  SuL.  Univ.  of  Goeleburg.  Goeleborg,  Sweden,  1962. 

(11)  D.  Lawrie  and  C.  Vora,  "The  Prime  Memory  System  for  Array  Access," 
Proc.  fnt  X.  Can/,  trn  Parallel  Proc  taring,  Aug.  1960,  61-67. 

(12)  David  A.  Turner,  "A  New  Implementation  Technique  for  Applicative 
Languages,"  So/twara  -  Practice  and  Experience  9,  (1979),  31-49. 

(13)  D.  A.  T Varner,  "Recursion  Equations  «as  a  Programming  Language,"  in 
Functional  Programming  and  ite  Applications,  J.  Darlington,  P.  Hender¬ 
son  and  D.  A.  Turner  (ed.),  Cambridge  Univ.  Press,  Cambridge,  1982. 


1.4  HulUjptMeaor  Circuit  Smulatien  (D.  G.  Kessersehmitt) 


Code  generators  for  LU  decomposition  in  e  SPICE  circuit  simulator  running 
on  a  multiprocessor  architecture  have  been  completed.  These  have  been  run  on 
a  multiprocessor  simulator,  and  their  performance  evaluated.  As  expected, 
because  of  the  1/0  bound  nature  of  this  computation,  the  performance  was 
found  to  degrade  significantly  as  the  interprocessor  communication  delays 
increase.  Subsequent  work  has  concentrated  on  improving  the  performance  by 
designing  scheduling  algorithms  which  take  into  account  interprocessor  com¬ 
munication  delays.  Two  types  of  heuristic  algorithms  have  been  developed:  local 
optimisation  and  global  optimization.  While  the  latter  take  significantly  more 
computer  time  to  execute,  and  may  therefore  not  be  practical  in  a  production 
circuit  simulation  environment,  they  show  approximately  double  the  speedup  of 
the  local  optimization  algorithm  and  thus  serve  as  a  basis  of  comparison.  A 
PhJD.  thesis  and  a  couple  of  papers  are  in  preparation  describing  this  work,  and 
will  be  available  in  the  next  quarter. 

An  examination  of  digital  filtering  in  a  parallel  computational  environment 
has  yielded  a  significant  and  surprising  result.  The  goal  has  been  to  find  an 
implementation  approach  far  digital  filters  which  would  have  the  property  that 
an  arbitrarily  ldgh  campling  rate  could  be  achieved  with  a  fixed  speed  of 
hardware  by  applying  parallelism  with  only  local  interconnection.  It  is  obvious 
bow  this  can  be  done  for  non-recursiva  filters,  but  the  surprising  fact  is  that  it 
can  also  be  achieved  for  recursive  filters,  where  the  feedback  destroys  speedup 
by  pipelining.  APh.D.  thesis  and  paper  are  also  under  preparation  on  tins  topic. 
Work  is  continuing  on  understanding  how  practical  constraints  (such  as  pin  I/O 
limitations)  limit  the  sampling  rate. 

Processor  interconnection  topology  end  routing  algorithms  are  continuing 
to  be  pursued.  Automatic  generation  of  topology  from  traffic  statistics  has  been 
implemented  using  a  clustering  algorithm.  Measures  of  interconnection 
hardware  complexity  and  speed  performance  are  being  developed  in  order  to 
evaluate  alternate  interconnection  topologies.  Routing  algorithms  which 
efficiently  exploit  a  given  topology  ere  also  being  investigated. 

In  cooperation  with  Professors  Newton  and  Sangiovanni,  ways  of  obtaining  a 
hardware  multiprocessor  machine  for  experimental  work  are  being  pursued. 
This  is  important,  as  we  expect  that  not  all  practical  constraints  can  be 
uncovered  by  software  simulations. 
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3E1.  1963  VLSI  Tools  Distribution  (J.  Oustarhout) 

Our  new  tools  tape  went  into  distribution  on  April  1.  Since  then,  approxi¬ 
mately  65  copies  of  the  tape  have  been  sent  to  university  and  industrial  labs  in 
the  U.S.  The  new  tape  contains  about  25  programs,  including  several  new  pro¬ 
grams  (Lyra.  Crystal,  Peg,  and  Tpack)  as  well  as  updated  versions  of  older  pro¬ 
grams  such  as  Caesar.  Cifplot,  and  k  extra. 


2.2.  Crystal.  Yeraion2  (J.  Ousterhout) 

During  the  spring  and  summer  of  1983,  most  of  Crystal  was  re-written.  The 
algorithms  in  the  new  version  are  simpler,  cleaner,  faster,  and  also  more  general 
*h«n  the  version  1  algorithms  (experience  is  a  wonderful  teacher!).  Whereas  ver¬ 
sion  1  had  built-in  notions  about  transistor  types  and  could  only  handle  nkOS, 
version  2  is  table  driven,  so  that  users  can  define  new  transistor  types.  Version 
2  has  already  been  used  for  both  nkOS  and  CMOS  designs,  and  is  slightly  faster 
than  version  1.  In  the  summer  of  1983.  work  was  begun  to  upgrade  the  transis¬ 
tor  models  to  include  second-order  effects  due  to  waveform  shape.  An  initial 
implementation  has  just  been  convicted,  but  its  accuracy  has  not  yet  been 
tested. 


2&  Caddy-A  New  1C  layout  System  (J.  Ouatertaout) 

We  have  undertaken  the  development  of  a  new  IC  layout  system  called 
Caddy.  The  system  has  three  overall  goals,  based  on  problems  experienced  with 
our  earlier  systems.  The  first  goel  is  to  integrate  design  rule  and  circuit  infor¬ 
mation  into  the  layout  editor  in  order  to  provide  incremental  design  rule  check¬ 
ing  and  circuit  extraction,  lids  additional  expertise  will  permit  interactive  com¬ 
paction  and  stretching  of  layouts.  The  second  goal  is  to  move  away  from  fabrica¬ 
tion  details  by  eliminating  the  need  for  designers  to  specify  implants  and  wells 
and  contact  details  explicitly.  These  layers  will  be  generated  automatically  by 
the  system  (the  result  is  much  like  "sticks"  except  that  it  is  fleshed  out).  The 
third  goal  is  to  provide  interactive  semi-automatic  routing  aids.  In  this  respect, 
our  goal  is  not  to  invent  new  algorithms  and  paradigms,  but  to  find  powerful 
ways  of  embedding  existing  techniques  into  an  interactive  design  environment. 

Initial  discussions  were  held  in  the  Spring  and  Fall  of  1982,  during  which  the 
underlying  data  structures  and  algorithms  ("comer  stitching")  were  developed. 
Between  January  and  April  of  1963  the  basic  structure  of  the  system  was 
designed  and  implementation  sms  begun.  A  bore-bones  system  with  about  the 
functionality  of  Caesar  (but  with  comer-stitching  as  the  underlying  data  struc¬ 
ture)  became  operational  in  early  April  1983,  and  has  been  used  since  then  in 
the  layout  of  the  nkOS  SOAR  microprocessor.  During  the  spring  and  summer, 
design  was  completed  ’or  compaction,  stretching,  design-rule  checking,  routing, 
and  mulUf  «  wind''  The  window  facilities  are  now  in  the  final  stage  of  debug¬ 
ging;  compact  jn.  stretching,  and  design-rule  checking  are  partially  imple¬ 
mented;  and  the  routing  implementation  (based  on  an  extended  version  of 


Rivest's  greedy  channel  router)  is  just  beginning. 


R.N.  Mayo,  J.K.  Ousterhout,  and  W.S.  Scott,  eds:  *‘1983  VLSI  Tools,"  Technical 
Report  No.  UCB/CSD  83/115,  Computer  Science  Division,  University  of  Cali¬ 
fornia,  Berkeley,  1983. 

J.K.  Ousterhout:  “Corner  Stitching:  A  Data  Structuring  Technique  for  VLSI 
Layout  Tools,'*  IEEE  TrantarHorut  on  Cnmpulitr  Mii*d  [hxign  of  fntngmtmd 
Circuits  and  Systams,  to  appear. 

J.K.  Ousterhout:  “The  User  Interface  and  Implementation  of  Caesar,*’ 
Technical  Report  No.  UCB/CSD  83/131,  Computer  Science  Division.  Univer¬ 
sity  of  California,  Berkeley,  1983. 
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2.4  flnite-State  Machine  Synthesis  (RNewton.  A.  Sangiovanni-Mncantalli) 

Sequential  circuits  play  a  major  role  in  the  control  part  of  digital  systems. 
Me  addressed  the  automated  synthesis  of  sequential  logic  functions  in  a  struc¬ 
tured  VLSI  design  methodology.  We  considered  sequential  logic  functions  imple¬ 
mented  by  synchronous  deterministic  Finite  State  Machines  (FSM)  consisting  of 
two  distinct  components:  a  combinational  circuit  implemented  by  a  Programm¬ 
able  Logic  Array  (P1A)  and  a  memory  implemented  by  Delay-type  registers. 

In  particular  we  considered  the  problem  of  assigning  binary  codes  to  the 
internal  states  of  a  Unite  State  Machine.  The  literature  is  rich  in  papers  dealing 
with  the  state-assignment  problem.  Here  we  refer  to  the  major  approaches  only. 
Armstrong  introduced  a  set  of  criteria  for  encoding  states,  aiming  at  the  minim¬ 
ization  of  the  number  of  gates  used  to  implement  the  FSM  and  formulated  the 
encoding  problem  as  a  graph  embedding  problem.  Hartmanis,  Stearns  and  Karp 
developed  algebraic  methods  based  on  partition  theory  and  on  a  reduced  depen¬ 
dence  criterion.  Dolotta  and  McCluskey  suggested  a  "column-based"  procedure 
to  code  states.  Note  that  despite  these  efforts,  to  the  best  of  our  knowledge  no 
tool  for  designing  FSM  is  in  use  today  for  a  time-effective  state  encoding  of 
industrial  digital  controllers. 

Armstrong’s  approach  can  in  principle  handle  rather  large  machines,  but  it 
has  three  serious  drawbacks.  The  first  is  related  to  the  fact  that  the  criteria  sug¬ 
gested  by  Armstrong  do  not  take  into  account  the  techniques  of  fast  heuristic 
logic  minimize rs  such  as  MINI,  PRESTO,  or  ESPRESSO-11  in  use  today  (Armstrongs 
paper  appeared  before  the  work  on  heuristic  minimize  rs  started).  The  second  is 
that  the  state-assignment  problem  is  transformed  into  a  particular  graph- 
embedding  problem,  which  represents  only  partially  the  state  coding  problem. 
The  third  is  that  the  graph  embedding  algorithm  suggested  by  Armstrong  was 
ineffective. 

Our  approach  is  based,  as  Armstrong's,  on  the  use  erf  distance  relations 
among  the  codes  of  the  internal  states.  We  showed  in  [1]  how  the  combinational 
logic  can  be  reduced  by  requiring  state  codes  to  satisfy  appropriate  distances. 
Distance  requirements  are  determined  by  predicting  the  effects  of  heuristic 
minimization  of  the  combinational  logic  related  to  a  symbolic  description  of  the 
FSM,  and  are  represented  by  a  graph.  In  particular  it  is  shown  that  a  convenient 
reduction  of  the  combinational  logic  is  obtained  if  the  distance  between  some 
state  codes  is  large  enough  and  appropriate  states  have  adjacent  codes. 

We  considered  the  problem  of  assigning  codes  which  satisfy  the  distance 
relations.  Adjacent  code  assignment  can  be  seen  as  an  embedding  of  an  adja¬ 
cency  graph  into  a  boolean  hypercube.  Armstrong  and  Saucier  represented  the 
state  assignment  problem  as  a  subgraph  isomorphism  problem,  where  a  one-to- 
one  relation  (coding)  is  sought  between  the  set  of  the  states  (vertices  of  the 
adjacency  graph)  and  a  subset  of  the  boolean  hypercube  vertices  (codes).  Note 
that  even  questioning  the  existence  of  a  subgraph  isomorphism  is  a  hard  prob¬ 
lem:  in  particular  it  was  shown  to  belong  to  the  class  of  Nr -complete  problems. 
Snce  such  an  isomorphism  may  not  exist,  Armstrong  and  Saucier  relaxed  some 
adjacency  requirements  and  proposed  heuristic  techniques  to  embed  a  sub¬ 
graph  of  the  adjacency  graph  into  the  boolean  hypercube.  Note  that  a  distance¬ 
preserving  embedding  is  not  even  guaranteed  by  augmenting  the  dimensions  of 
the  hypercube,  Le.  increasing  the  length  of  the  state  codes. 

Our  approach  exploits  the  usa  of  dc  conditions  in  state  codes.  In  particular 
every  state  is  coded  by  associating  each  vertex  of  the  adjacency  graph  to  a  sub¬ 
cube  of  the  boolean  hypercube.  This  is  equivalent  to  embed  the  adjacency  graph 
into  a  squashed  hyperoube,  Le.  a  hypercube  having  appropriate  faces  squeezed 


Vv 


into  vertices.  Note  that  most  of  the  state  assignment  techniques  presented  in 
the  literature  obtained  a  state  coding  using  the  minimum  number  of  bits, 
because  it  was  important  to  minimize  the  number  of  memory  elements  due  to 
their  cost.  On  the  other  hand,  the  area  taken  by  the  PLA  is  the  major  concern  in 
a  VLSI  circuit  implementation  of  a  Finite  State  Machine.  Minimal  area  PLA 
implementations  of  the  FSM  combinational  component  can  be  obtained  by  using 
non-minimaMength  state  codings  i.e.  fewer  product-terms  are  often  required  to 
Implement  a  logic  function  at  the  expense  of  an  increased  number  of 
input/output  columns.  Therefore  we  allow  non- minimal-length  state  codings 
when  leading  to  minimal  area  PLAs.  In  this  case,  state  coding  corresponds  to  an 
embedding  into  a  squashed  hypercube  of  variable  dimension.  However  bounds 
on  code  length  can  be  enforced  when  required  by  a  particular  implementation. 

Experimental  results  obtained  using  this  approach  on  a  number  of  indus¬ 
trial  machines  has  been  satisfactory.  However,  very  recently  a  new  technique 
based  on  multiple-valued  logic  minimization  and  on  the  ideas  described  above 
has  been  obtained  in  collaboration  with  Dr.  R.  Brayton  of  IBM.  This  technique  has 
been  very  successful  in  coding  Finite  State  Machines  efficiently,  improving  over 
manual  techniques  and  over  the  technique  presented  above  by  a  sizable  amount. 
We  are  presently  studying  its  implications  and  developing  new  embedding  algo¬ 
rithms. 


(l)  G.  De  Micheli,  A.  Sangiovanni-Vincentelli.  T.  Villa,  "Computer-aided  Synthesis 
of  Finite-State  Machines"  Pro c.  of  hit.  Cbnf.  on  CAD  1983.  Santa  Clara.  CA 
Sept.  1903. 
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2.8.  MmtiCQ  Bue d  Circuit  Simulation  (R  Newton,  A.  Sangiovanni- 
Wfacentefli) 


Recently,  a  -new  class  of  algorithms  has  been  applied  to  the  electrical  1C 
simulation  problem.  New  simulators  have  been  developed  at  Berkeley  (RELAX 
and  RE1AX2,  SPLICE1.8  and  SPLICE2)  that  use  these  methods  and  provide  as 
accurate,  or  more  accurate,  waveforms  than  standard  circuit  simulators  such  as 
SP1CE2  or  ASTAP  with  up  to  two  orders  of  magnitude  speed  improvement  for 
large  circuits.  These  simulators  have  been  used  for  the  analysis  of  both  digital 
and  analog  NOS  ICs.  They  use  relaxation  methods  for  the  solution  of  the  set  of 
ordinary  -differential  equations,  ODEs  which  describe  the  circuit  under  analysis, 
rather  than  the  direct,  spars  e-matrix  methods  on  which  standard  circuit  simula¬ 
tors  are  based. 

During  this  period,  we  studied  the  numerical  properties  of  the  various 
methods  for  the  analysis  of  NOS  circuits  and  we  presented  them  in  a  rigorous 
and  unified  framework  in  [1]  and  we  improved  our  relaxation  algorithms  and 
their  implementation  in  [2]  . 

Fslaxation-bassd  Electrical  Simulation  ha*  been  written  to  proride  a  com¬ 
plete  picture  of  the  new  methods  for  circuit  analysis  and  we  expect  it  to  become 
the  standard  reference  for  new  work  on  circuit  simulation.  Both  the  advantages 
and  the  limitations  of  these  techniques  for  the  analysis  of  large  ICs  are 
described.  Some  of  the  fundamental  problems  associated  with  conventional  cir¬ 
cuit  simulation  algorithms  as  circuit  size  increases  are  exposed  and  the 
mathematical  basis  for  the  relaxation  approach  is  introduced.  The  special 
relaxation  methods  called  timing  simulation  algorithms  are  described  and  their 
numerical  properties  are  investigated.  Iterated  Timing  Analysts,  which  applies 
relaxation  techniques  at  the  nonlinear  equation  level,  is  described  and  its  con¬ 
vergence  properties  are  proven.  The  Waveform  ft flotation  method,  which 
applies  relaxation  techniques  at  the  differential  equation  level,  is  presented  and 
various  techniques  which  can  be  used  to  improve  its  performance  for  electrical 
simulation  are  described.  Future  research  directions  including  the  use  of  spe¬ 
cial  purpose  hardware  for  the  implementation  of  these  algorithms  are 
presented. 

In  [2],  we  described  KELAX2,  a  new  improved  version  of  RELAX,  a  Waveform- 
Relaxation  based  simulator.  In  particular,  we  were  able  to  characterize  the  con¬ 
vergence  behaviour  of  the  Waveform  Relaxation  Method  on  a  class  of  circuits 
which  required  a  large  number  of  iterations  to  converge.  The  study  of  the  con¬ 
vergence  behaviour  has  led  into  the  concept  of  "windowing”,  Le.,  of  breaking  up 
the  time  interval  over  which  analysis  has  to  be  perfumed,  in  sub-intervals  so 
that  the  algorithm  applied  in  these  sub-intervals  exhibits  fast  convergence.  In 
addition,  techniques  for  the  adaptive  error  control  of  Waveform  Relaxation  have 
been  tested  and  improved. 


(1)  R  Newton  and  A.  Sangiovanni-Vineentelli,  "Relaxation-Based  Electrical 
Simulation",  IEEE  Trans,  on  Else.  Dm.,  to  appear,  SIAM  Joum.  on 
Scientific  and  Statistical  Computing,  to  appear,  IEEE  Trane,  on  CAD  of  IC 
send  Syst.,  to  appear. 

(2)  J.  White  and  A.  Sangiovanni-Viocentelli,  'RE LAX 2:  A  New  Waveform-relaxation 
Approach  for  the  Analysis  at  MOS  LSI  Circuits"  Proc.  of  Jnt.  Symp.  on  Ore. 
and  Syst.,  Newport  Beach,  Ca.,  May  19B3,  invited  paper. 


2.6.  Special  Purpose  Architectures  Thr  The  Solution  of  Large  Scale  System*  (R. 
Newton.  A.  Sangtovanni-Yincentelli) 


The  solution  of  Large-scale  linear  System  of  algebraic  Equations(LSE)  is 
needed  in  the  analysis  and  simulation  of  many  engineering  systems. 

New  architectures,  in  particular  vector  computers  such  as  the  CRAY  1,  have 
inspired  the  design  of  new  algorithms  to  exploit  parallelism  in  the  solution  pro¬ 
cess.  An  important  example  is  the  program  CLASS1E  for  the  simulation  of  elec¬ 
tronic  circuits.  Along  these  lines,  peripheral  array  processors,  such  as  the 
FPS184.  can  also  be  used  in  conjunction  with  hosts  such  as  the  VAX1 1/780  to 
speed  up  the  solution  process.  However,  this  speedup  is  not  enough  to  cope  with 
the  problems  to  be  solved  in  the  VLSI  era. 

The  advent  of  VLSI  technology  has  made  the  cost-effective  design  of  special 
purpose  machines  possible.  Examples  of  these  machines  are  the  Yorktown  Simu¬ 
lation  Engine(YSEj  for  logic  solution  and  Systolic  Arrays  Special  purpose 
machines  have  also  been  proposed  for  the  solution  of  LSE.  Host  of  these 
machines  limit  the  size  of  the  operand  matrix.  When  no  size  limit  is  imposed,  the 
(per and  matrix  has  to  be  partitioned  into  submatrices  of  equal  sizes.  Only 
Johnsson  and  Pottle  treated  the  related  numerical  properties.  However  special 
matrix  structures,  such  as  the  Bordered  Block  Diagonal  Form  (BBDF)  or  the  Bor¬ 
dered  Block  Triangular  Form(BBTF),  commonly  expected  in  engineering  prob¬ 
lem.  have  not  been  exploited.  In  £l],  we  proposed  a  new  algorithm-architecture 
BLOSSOM  for  the  solution  of  LSE. 

This  architecture  supports  other  matrix  operations  used  as  subprocedures 
by  block  LU  decomposition  such  as  the  multiplication  and  the  inversion  of  sub- 
matrices.  We  described  the  hardware  implementation  of  these  matrix  opera¬ 
tions. 


(1)  H.  Kb  and  A.  Sangiovanni-VincenteUi,  "BLOSSOM:  an  Algorithm  and  Architec¬ 
ture  for  the  Solution  of  Large-Scale  Linear  Systems"  Proceedings  of  the 
fntemaHrmnl  Qm/enmen  on  Computer  Design,  New  York,  Oct.  1683. 


S.7.  henmmtol  Wr»  Mn^uhUca  Ttar  Via  Layout  (C.  Sequin) 

Existing  wire-based  layout  manipulation  systems  have  tended  to  do  opera¬ 
tions  such  as  compaction  on  a  global  basis,  compacting  an  entire  layout  or 
module  at  a  time.  WlCRD  (Wire  Incremental  Compaction,  Routing,  and  Displace¬ 
ment)  is  an  experimental  prototype  system  to  explore  algorithms  for  efficient 
incremental  changes  to  an  existing  layout,  while  still  allowing  these  global  opera¬ 
tions  to  proceed  efficiently.  The  WlCRD  user  model  envisions  a  user  sitting  at  a 
graphics  terminal,  making  many  small  incremental  changes  to  the  layout  and 
only  occasionally  invoking  more  global  operations.  A  typical  WlCRD  operation 
would  be  moving  a  single  wire  segment  a  short  distance,  changing  the  position  of 
as  few  other  objects  as  possible  while  maintaining  the  layout  topology  and  con¬ 
nectivity.  One  of  the  conclusions  demonstrated  by  WlCRD  is  that  global  compac¬ 
tion  may  be  viewed  as  a  special  case  of  incremental  displacement,  using  the 
same  conceptual  framework  and  algorithms. 

One  of  the  issues  investigated  in  WlCRD  is  that  of  wire  representation. 
Three  classes  of  wire  representations  were  Investigated:  connected  skeletons, 
fleshed-out  geometry,  and  directed  finite-width  wire  segments.  Skeletal 
representations  model  a  wire  as  a  chain  of  connected  line  segments,  typically 
chosen  to  be  the  centerline  of  the  wire,  with  attached  attributes  such  as  width 
and  allowable  separation  from  other  wires.  They  correspond  closely  to  the 
abstract  semantics  of  wires,  but  have  disadvantages  when  used  in  conjunction 
with  circuit  blocks  of  finite  size  because  of  their  lack  of  explicit  geometrical 
information. 

flesbed-out  geometry  models  explicitly  represent  the  physical  space  occu¬ 
pied  by  the  wire  material.  They  are  a  convenient  representation  for  layout  rule 
checking,  but  make  it  harder  to  keep  track  of  connectivity  and  other  semantic 
properties  of  the  wiring  pattern.  Directed  wire  segments  attempt  to  combine 
the  best  features  of  skeletal  and  fleshed  representations  by  starting  with  a 
fleshed  representation  and  adding  explicit  attributes  to  indicate  the  underlying 
wire  structure.  The  wire  representation  implemented  in  WlCRD  uses  this 
representation. 

The  various  representations  also  lead  to  interesting  trade-ofls  in  the  areas 
of  automatic  routing  and  compaction  and  in  the  user  interface. 


(1)  C.H.  Sequin  and  R.M.  Fujimoto.  "X-tree  and  Y-components”,  in  VLSI  Arctd- 
tecfure,  B.  Randell  and  P.C.  Treleaven,  editors,  Prentice  Hall,  1963,  pp  299- 
328. 


(2)  If.  Fujimoto,  "VLSI  Communication  Components  for  Multicomputer  Net¬ 
works",  PhD  Thesis,  U.C.  Berkeley,  Mtg.  25,  1983. 


Circuitry  contained  on  on*  multibus  card  will  be  able  to  recognize  1000 
words  in  real  time.  On  the  card  are  2  custom  l.C.'s,  an  Intel  80180  microcom¬ 
puter  and  memory.  Ibis  card  is  in  the  final  stages  of  construction. 

One  of  the  ICs  is  a  filterbank  chip  which  was  generated  fully  automatically 
from  software.  This  chip  implements  a  16  channel  filterbank  with  a  112  poles  of 
filtering,  at  an  initial  sample  rate  of  14  kHz. 

The  2nd  chip  is  an  enhanced  version  of  a  previous  design,  which  performs  a 
dynamic  programming  algorithm.  The  new  chip  has  more  parallelism  as  well  as  a 
number  of  glue  logic  functions  for  memory  control  which  were  required  to  be 
able  to  fit  the  entire  recognition  system  on  one  board. 

The  186  performs  the  multibus  interface  and  will  be  used  in  future  research 
to  implement  such  things  as  syntax  direction,  continuous  speech  algorithms  and 
sophisticated  training  (learning)  algorithms. 

We  are  putting  this  card  into  a  SUN  workstation,  and  plan  to  use  it  to  incor¬ 
porate  speech  into  a  number  of  applications.  The  following  two  sections  describe 
the  two  special  purpose  chips  in  more  detail. 


3l2.  Dynamic  Programming  Integrated  Circuit  (RW.  Bbodereen) 

An  integrated  circuit  capable  of  performing  the  pattern  matching  required 
to  recognizing  1000  words  of  speech  in  real-time  will  has  been  designed  and 
fabricated.  The  chip  uses  a  parallel  and  pipelined  architecture  to  compute  the 
Euclidean  distance  between  two  18  dimensional  vectors  (4  bits  per  dimension) 
while  simultaneously  executing  a  dynamic  programming  algorithm.  The  chip 
requires  only  a  minimum  amount  of  external  circuitry:  a  clock,  memory,  and 
address  latches. 

The  boundary  conditions  of  the  dynamic  programming  matrix  can  be  con¬ 
trolled,  so  that  the  chip  is  compatible  with  both  isolated  and  connected-word 
algorithms  without  additional  hardware.  No  pruning  or  global  slope  constraints 
ore  used,  thus  each  frame  of  the  unknown  word  must  be  compared  to  each 
frame  in  the  template  word  memory.  Local  slope  constraints  can  be  applied  as  a 
programmable  option.  Tor  a  one  thousand  word  vocabulary,  with  25  frames  per 
word,  at  a  20ms  frame  rate,  25000  frames  must  be  stored  in  the  template 
memory.  The  processing  of  a  single  frame  requires  that  a  16-dimensional 
Euclidean  distance  be  computed,  as  well  as  a  minimization.  The  chip  can  per¬ 
form  these  computations  in  800ns,  thus  all  template  frames  can  be  processed  in 
20ms.  The  chip,  therefore,  not  only  runs  in  real  time,  but  can  process  all  unk¬ 
nown  word  frames  before  the  endpoint  is  detected. 

The  chip  has  the  following  functional  units,  all  running  in  parallel: 

(1)  A  distance  processor  that  can  compute  a  4-dimensional  Euclidean  distance 
•very  dock  cycle. 

(2)  A  pipeline  accumulator  that  sums  four  4-dimensional  Euclidean  distances 
into  one  18  dimensional  distance. 


(3)  A  dynamic  programming  processor  that  can  compute  one  minimization 
•eery  4  clock  cycles. 

(4)  An  addressing  unit  for  the  external  template  and  scratch-pad  memories. 

(5)  A  controller  for  each  of  the  above  processors. 

The  chip  is  implemented  in  a  4  micron  NMOS  process,  has  an  active  area  of 
20,000  square  mils,  and  runs  with  a  5MHz  dock. 

3.21.  Computer  Generated  Digital  filter  Banks 

The  goal  of  this  project  is  to  allow  the  system  designer  to  generate  digital 
filter  bank  chips  without  performing  any  circuit  design  or  layout.  The  circuit  is 
entirely  computer  generated  and  requires  about  S  to  10  minutes  after  the  filter 
coefficients  and  filter  organizations  have  been  determined. 

The  project  has  been  broken  into  two  parts.  First,  a  pipelined  digital  pro¬ 
cessor  has  been  designed  with  a  major  focus  on  speed  and  small  size.  Also 
important  is  modularity,  so  that  the  processors  can  be  easily  assembled  in 
different  configurations  with  a  minimum  of  signal  routing.  Second,  programs 
have  been  developed  to  turn  system  information  (filter  types,  coefficients, 
number  of  processor  bits)  into  completed  circuit  layouts,  ready  for  fabrication. 
1b  enhance  flexibility,  two  separate  programs  are  employed.  One  generates  the 
micro-code  for  the  processors  from  the  filter  descriptions,  while  the  other  gen¬ 
erates  the  chip  layout  from  the  micro-code. 

A  completed  16  channel  (each  channel  has  a  4th  order  bandpass  filter, 
rectifier  and  a  3rd  order  lowpass  filter)  operating  at  a  14KHz  sample  rate,  has 
been  designed  and  fabricated.  Test  results  indicate  that  frequency  response 
end  dynamic  range  are  identical  to  that  expected.  In  addition,  the  circuits 
operated  nearly  SOX  faster  than  required  for  the  current  application,  indicating 
more  complex  systems  are  passible.  The  program  for  generating  layouts  from 
micro-code  has  been  completed  and  chips  generated  by  this  program  have  been 
fabricated.  The  program  for  generating  micro-code  from  filter  descriptions  is  in 
development.  The  system  now  has  the  capability  to  use  one  or  two  processors 
and  extension  to  allow  4  processor  circuits  is  being  added. 


(1)  M.  Lowy,  H.  Ifurviet,  R  V.  Brodersen,  HA  Large  Vocabulary  Speech  Recogni¬ 
tion  System".  Proc.  of  VLSI  in  Modem  Signal  Procuring  USC. 

(2)  T.  C.  Choi,  RT.  Kaneshiro,  P.R  Gray,  W.  Jett,  M.  Wilcox.  R  W.  Brodersen, 
"High  Frequency  CMOS  Switched-Capacitor  Filters  for  Communications 
Applications"  1983  IEEE  Inti.  Solid  Staff  Omits  ComJ.  Tech.  Digas t,  pp. 
246-297. 

(3)  RD.  Fellman,  R  W.  Brodersen,  "A  Switched  Capacitor  Adaptive  Lattice 
Filter"  Journal  of  SoUd  Stats  Circuits.  February  1983,  pp.  46-56. 

(4)  T.  Choi,  P.  R  Gray,  R  Kaneshiro,  R  W.  Brodersen.  "Circuits  and  Systems 

Paper. 

(5)  H.  Ifurviet,  If.  Lowy,  D.  lfintz,  R  W.  Brodersen,  "An  Architecture  for  a 
Speech  Recognition  System",  Tech.  Digest  of  1SSCC,  February  1983. 
pp.  118-1 19. 

(6)  RD.  Fellman,  RJ.  Hurst,  R  W.  Brodersen,  "Switched-Capacitor  Circuits  for 
Adaptive  filtering  and  Autocorrelation".  Tech.  Digest  of  ISSCC,  February 
1883,  pp.  128-127. 
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9l9l  Rapid  Design  Kethods  for  S^nal  Proc*?*t(j|  Circuits  (RW.  Brodersen) 

A  rapid  design  system  is  being  implemented  for  monolithic  implementation 
of  signal-processing  systems.  This  interdisciplinary  effort  spans  the  fields  of  cir¬ 
cuit  design,  CAD  software  development,  computer  architecture  and  digital  sys¬ 
tems  design. 

The  circuit-design  aspects  of  this  project  involve  development  of  macrocell 
libraries  in  both  NKOS  and  CKOS  technologies.  These  libraries  allow  signal- 
processing  circuits  to  be  assembled  rapidly,  using  proven  architectural 
approaches  to  signal-processing  problems.  To  a  large  extent,  they  are  upgraded 
versions  of  ceils  already  used  successfully  for  implementing  several  signal- 
processing  functions. 

The  CAD  aspects  of  the  project  involve  developing  software  that  allows 
behavioral  simulation  and  automatic  layout  generation  of  the  target  circuit 
functions  based  on  a  high-level  description  provided  by  the  designer.  The 
software  system  will  allow  signal-processing  researchers  to  implement  special- 
purpose  circuits  rapidly  and  efficiently,  even  if  their  background  in  circuit  and 
system  design  is  rather  modest  The  efficient  silicon  area  utilization  in  this  sys¬ 
tem  far  exceeds  that  of  current  and  proposed  "silicon  compilers.” 

Target  appplications  envisioned  include  filters,  speech  recognition,  low-bit- 
rate  speech  coding,  image  processing,  modems.  line  equalizers,  and  other 
telecommunications  functions. 


(1)  RI.  Brodersen.  S.  Pope,  'Kacrocell  Design  far  Concurrent  Signal  Process¬ 
ing",  Cbmputer  Science  Prut.  pp.  395-412. 

(2)  R  ¥.  Brodersen,  "Implications  of  VLSI  Technology  for  Speech  Processing". 
Abstract  in  Proe.  of  COUPCON,  Los  Angeles,  CA.  February  1983,  pp.  845. 


(3)  Sue  Kellers, "  A  Preprocessor  for  a  Speech  Recognition  System" 

(4)  Alan  Yuen,  "Design  of  a  Sigital  Spectra  Analyzer" 

(5)  David  liintz,  "An  Implementation  of  a  Speech  Recognition  System" 

(tf)  Brie  DXvies,  "Endpoint  Detection  of  Speech  for  Real  Time  Isolated  Word 
Recognition" 

(7)  Ramin  Esrnail  Beygln,  "An  LSI  7  Pule  Luwptus  Swilched-Capacilur  Filler" 

(8)  Robert  Keveler,  "Clustering  Analysis  Applied  to  Speech  Recognition" 
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(9)  Tat  Choi.  "High  Frequency  Switched-Capacitor  Filters" 

(10)  Ron  Fe liman,  "An  K0S-LS1  Adaptive  linear  Predidium  Filter  for  Speech  Pro¬ 
cessing" 

(11)  Ken  ah  am  Lowey,  "Design  Considerations  of  a  Speech  Recognition  System 
Using  Special  Purpose  Integrated  Circuits" 

(12)  Hy  Kurviet,  "A  1000  Word  Real  Time  Speech  Recognition  System" 
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3il  MOB  Analog  Oreate  (P.  Hodges) 


3A.1. 


This  project  wee  aimed  et  improving  the  speed  and  linearity  performance  of 
Jest  digital-to-analog  converters  (DACs)  for  use  in  video  and  vector  graphic 
display  systems.  Existing  DAC  designs  typically  employ  R-2R  resistive  ladders 
and  require  trimming  to  eliminate  differential  nonlinearity.  Also  they  often 
exhibit  harmful  "glitches"  (transient  output  excursions)  at  major  transition 
points  due  to  time  delay  skews  among  decoder  paths. 

A  new  MOSLSI  DAC  technique  has  been  developed  and  demonstrated  to 
achieve  excellent  differential  linearity  and  freedom  from  glitches,  [l]  A  8-bit 
experimental  DAC  was  fabricated  in  our  laboratory  using  4  micron  silicon  gate 
NMOS  technology  to  demonstrate  the  linearity  and  glitch  immunity  of  the  new 
technique.  Overall  die  size  is  2.5  mm  by  3  mm.  Output  settling  to  1%  in  less 
than  60  ns  was  observed.  Glitch-free  response  is  observed  even  with  input  skew 
greeter  than  20  ns.  A  doctoral  dissertation  on  this  work  is  in  preparation. 

Recent  simulation  studies  have  shown  that  through  use  of  an  advanced  3 
micron  CMOS  process  ("Berkeley  CMOS")  the  settling  time  can  be  reduced  to  40 
ns.  [2]  Estimated  die  size  with  this  process  is  1.7  mm  by  2.0  mm  Extension  to 
10  or  even  12  bits  resolution  without  sacrifice  of  differential  linearity  or  glitch- 
free  performance  can  be  contemplated  by  using  two  levels  of  cascoding  to  com¬ 
bine  output  currents. 

342  On-Board  frequency  Reference 


Analysis  and  simulation  have  been  carried  out  for  an  on-chip  CMOS  active 
sinusoidal  oscillator  requiring  no  external  components  and  having  a  frequency  of 
oscillation  which  may  be  set  in  the  range  1  MHz  to  5  MHz  by  a  single  digital  trim¬ 
ming  operation.  [3l  the  intended  application  of  this  oscillator  is  to  provide  a  sys¬ 
tem  clock  for  VLSI  computing  or  communications  components,  eliminating  the 
need  for  external  RC  or  LC  elements.  Frequency  stability  comparable  to  crystal 
oscillators  cannot  be  expected. 

The  proposed  oscillator  is  composed  of  two  CMOS  operational  amplifier 
integrators  connected  in  a  closed  loop.  Frequency  trimming  would  be  by  means 
of  a  digitally-adjusted  bias  current  source.  Oscillation  is  initiated  by  normal 
thermal  noise  and  reaches  steady-state  within  10-50  microseconds.  Compared 
to  relaxation  oscillators,  this  sinusoidal  oscillator  may  exhibit  smaller  timing 
jitter  because  the  period  of  oscillation  is  determined  by  integration. 

Simulation  studies  show  that  a  5  MHz  oscillator  with  10-15%  harmonic  dis¬ 
tortion  can  be  achieved  with  about  5  mV  of  power  in  a  3  micron  CMOS  process. 
Assuming  that  a  suitable  temperature-compensated  bias  current  source  can  be 
designed,  frequency  stability  better  than  *  20%  over  temperature  and  supply 
voltage  venations  should  be  achievable. 


(1)  V.  W-K  Shen  k  D.  A  Hodges.  "A  80  ns  Glitch-Free  NMOS  DAC."  tSSCC  Digut 
gf  Tach.  Popart,  1983,  pp.  188-189. 


(2)  K.  Hirota,  "CMOS  40  ns  8  bit  Digital  to  Analog  Converter."  M.S.  Plan  II  Report. 
University  of  California,  Berkeley,  Aug.  1983. 

(3)  F-R  Leu,  "CMOS  Sinusoidal  Oscillator,"  M.S.  Plan  II  Report,  University  of  Cali¬ 
fornia,  Berkeley,  Aug.  1983. 
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4U.  Thin  Olid*  Hindi— 

In  April,  n  paper  on  the  theory  of  oxide  tunneling  conduction  [l]  wee 
presented  at  the  International  Conference  on  Insulator*  on  Silicon  in  Holland  by 
a  student.  The  conference  paid  for  his  travel  expenses.  Another  paper  on  thin 
oxide  device  degradations  was  {resented  and  published  in  the  proceedings  of 
Electrochemical  Meeting  in  May  [2].  Later  in  June  at  the  Device  Research 
Conference,  *we  presented  a  paper  on  oxide  charge  transport  [3].  Some  novel 
phenomena  of  deep  depletion  regions  being  created  or  destroyed  by  charge  tun* 
neling  through  thin  oxides  appeared  in  the  August  issue  of  Electron  Davies 
Jjattan  [10].  A  comprehensive  paper  of  our  study  on  the  100  A-gate  MOSFET 
degradations  was  submitted  [llj.  Two  abstracts  have  been  submitted  to  the 
1963  International  Electron  Devices  Meeting  -  one  dealing  with  impact  ionization 
by  tunneling  electrons  [14]  and  the  other  dealing  with  a  comparison  of  device 
degradations  Induced  by  hot  electrons  and  by  tunneling  electrons  [16].  finally. 
Professor  C.  Hu  will  present  an  invited  paper  on  this  series  of  studies  at  the  1983 
IEEE  Interface  Specialist  Conference  in  December  1963,  in  Florida. 

In  the  next  6  months,  emphasis  is  expected  to  shift  to  the  phenomenon  of 
time-dependant-dielectric  breakdown  —  a  leading  reliability  concern  and  chal¬ 
lenging  puzzle.  York  will  also  begin  on  new  dielectrics  such  as  oxide-nitride 
combinations. 


4.1.2.  Hot  Electron  Hindi— 

_ The  pace  of  this  fruitful  research  in  this  area  continued.  A  paper  on  MOS¬ 
FET  characteristics  near  and  beyond  breakdown  appeared  in  IEEE  Tran*.  Elec¬ 
tron  Dlrvicss  in  June  [4],  Ye  published  the  first  report  on  hot-electron  currents 
in  0.14  fim  channel  MOSFFJT  (and  our  model  worked  very  well  for  that  channel 
length)  in  June  £5].  The  study  on  punebthrough  voltage  appeared  in  September 
[6],  Another  find  was  the  successful  photographing  uf  light  amission  from  Si 
IIOSfETs  and  filamentary  conduction  under  oertein  bias  conditions  [9].  A  com¬ 
plete  report  on  the  channel  hot-electron  Injection  was  submitted  [12].  light 
amission  was  also  observed  from  forward-bias  PN  junctions  [16].  The  data  and 
theory  of  bremsstrahlung  radiation  from  Si  MOSFET  has  been  submitted  for  pub¬ 
lication  [17].  Finally,  Professor  C.  Hu  will  give  en  invited  paper  on  this  research 
at  the  1963  International  Electron  Devices  Meeting.  A  spin-off  of  this  re— arch  is 
a  model  of  the  feedthrough  voltage  in  switch-capacitor  circuits  [8. 13]. 

One  ante  of  future  work  will  emphasize  new  structure*  that  minimize  hot- 
alec  tron  affects.  Another  project  is  concerned  with  obtaining  a  detailed  picture 
of  the  effects  of  hot-carrier  transfers  on  the  valency  and  density  of  interface 
otatas.  Ya  are  studying  a  modified  charge-pumping  technique  that  promises  to 
permit  the  determination  of  several  important  interface-state  properties. 
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(l)  C.  Chong,  C.  Hu,  RY.  Broder—u.  "Direct  and  Fowlar-Nordheim  Tunneling  in 
Thin  Gate  Oxide  MOS  Structures",  fntrmational  Camfaranct  on  Insulating 
fUmi  on  Silicon  (INFOS),  April  1963.  Proceedings  to  be  published  by  North 
Holland  Publishing  Company. 
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xsea  pp.  609-610. 
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IEEE  Transactions  an  EUctron  Devices,  September  1983. 
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mitted  to  IEEE  Journal  of  Solid  Rdi  Circuits. 
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(14)  C.  Hu.  T.  Ong.  S.  Tam.  K  Terrill,  "Light  Emission  from  Forward-Biased  PN 
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•LB.  Barkalsy  Advanced  CKOS  (A.  Neureuthar) 

Dm  fabrication  and  tasting  of  the  first  set  of  devices  with  the  Advanced 
Berkeley  CMOS  process  has  been  completed.  The  p-channel  devices  threshold 
eras  not  sufficiently  negative  to  turn  the  devices  off.  The  n-channel  devices 
worked  as  expected.  The  p-channel  threshold  problem  was  traced  to  a  malfunc¬ 
tion  of  the  P0C13  doping  which  had  to  be  repeated  after  a  cleaning  etch.  This 
«teh  reduced  the  poly  thickness  to  the  point  that  the  source/drain  implant  was 
not  adequately  masked.  This  problem  has  been  corrected  on  two  runs  in  pro¬ 
gress.  Part  of  our  present  effort  is  to  bring  up  this  process  on  4"  wafers  in  our 
new  microelectronics  laboratory  which  is  more  suited  to  exploring  design  rule 

wealing. 


a  mutation  Aids  Tar  Mewing  Wafer  Topography  from  Layout  (A. 
fieureuthar) 


The  object  of  this  project  Is  to  develop  an  1C  design  aid,  SIMFL  (SIMu- 
iated  Profiles  from  Layout),  which  gives  the  cross  sectional  view  of  the  wafer 
topography  and  doping  profile  associated  with  the  layout  on  a  graphics  editor. 
This  simulator  will  complete  the  computer  aided  IC  design  sequence  by  Unking 
the  layout  program  to  the  topography  eimulntor,  SAMPLE,  and  its  associated 
post  processors  for  determining  electrical  parameters. 

The  first  level  simulator.  SDCPLl  la  now  running  in  "C".  Results  for  CMOS 
and  Bipolar  processes  are  described  in  the  attached  abstract  which  has  been 
accepted  for  presentation  at  IEDM  in  December.  SIMPL1  quickly  calculates  the 
approximate  cross  sectional  view  of  the  wafer  at  any  step  in  the  fabrication  pro¬ 
cess.  In  order  to  make  the  simulation  fast,  S1MFL1  approximates  aU  features 
as  rectangles  and  has  simple  models  of  device  physics.  The  simulations  provide 
visual  verification  of  the  layout,  and  the  effects  of  each  process  step.  The 
-complete  simulation  of  a  CMOS  inverter  runs  in  5  sec.  on  a  VAX  11- 
7B0/UNDC. 

The  second  level  simulator,  !SIMP12,  will  employ  more  process  parame¬ 
ters  and  give  a  more  realistic  cross  sectional  view  using  the  string  model  for 
profiles.  The  optional  direct  caU  of  SAMPLE  from  SIMPL2  will  be  implemented 
for  the  accurate  topography  simulation.  The  conceptual  design  and  implemen¬ 
tation  of  SIMPL2  is  now  under  study. 


\ 
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4.4.  Aluminum /SUlcon  Coo  tact  nectromlgretian  and  Contact  Resistivity  Mean* 
unaunt  (W.G.  Oldham) 


The  failure  of  Al/Si  contact  due  to  current  stress  is  being  Investigated.  As 
device  geometries  scaled  down,  the  density  of  current  flow  through  conductor  is 
increased,  hence,  aggravates  the  occurrence  of  ele ctromigr ation.  Two  types  of 
failure  have  been  observed,  one  is  contact  leakage/spiking  which  is  due  to  high 
silicon  solubility  in  aluminum  and  electromigration,  the  other  is  aluminum  line 
opening  which  is  due  to  aluminum  electromigration.  Other  refractory 
metal/ailicide  to  silicon  contact  is  also  being  considered. 

Special  attention  has  been  paid  to  design  the  testing  apparatus  in  such  a 
way  that  chip  temperature  rise  due  to  local  heating  is  eliminated  and  constant 
temperature  throughout  the  chip  is  assured.  A  ISM  personal  computer  is  also 
used  as  the  controller  to  perform  automatic  stressing,  resistance  and  leakage 
current  measurement,  data  analysis,  etc. 

The  location  of  contact  failure  site  is  dependent  on  the  contact  current  dis¬ 
tribution  which,  in  turn,  depends  on  contact  resistivity  and  geometry.  A  study 
on  contact  resistivity  measurement  is  also  conducted.  A  model  using  conformal 
transformation  is  adapted  to  study  current  distribution  along  and  below  the  con¬ 
tact  A  novel  technique  which  would  yield  uniform  contact  resistivity  while 
preserves  electromigration  resistance  Is  also  being  considered. 


(1)  R  Kazerounian,  V.  G.  Oldham,  'Threshold  Shift  from  As-Implantation  of 
Si02‘',  Pmr.oot&ng*  of  thm  ELoctror.homUml  SocimTy  Mooting,  May  1963,  San 
Francisco. 

(2)  R  Kazerounian,  V.  G.  Oldham,  "CODMOS  A  Depletion  Device  Using  Fined 
Oxide  Charge”,  prooontod  at  Dovico  Monarch.  Cornforonco,  Juno  1903,  Bur- 
tkngtan. 


4.5.  Mmllflratlrin  trf  MtTal  rmtsnti  Tfltti  Inn  Drams  (H  ftimng) 


The  objective  of  this  project  is  to  investigate  the  effects  of  ion  implantation 
on  the  electrical  and  metallurgical  properties  of  metal-silicon  contacts.  Both 
silicide  forming  and  non-silicide  forming  systems  are  being  studied. 

Arsenic  ions  have  been  implanted  through  thin  metal  films  to  doses  on  the 
order  of  10E14  to  10E16  cm-2.  The  high  dose  implant  is  used  to  simultaneously 
intermix  the  metal-Si  interface  and  heavily  dope  the  interface  silicon.  In  the 
simple  Al-Si  eutectic  system,  we  have  observed  that  interface  mixing  enhances 
the  uniform  dissolution  of  Si  into  Al,  which  minimizes  the  A1  pitting  into  shallow 
S  junctions  due  to  subsequent  annealing.  In  silicide  forming  systems,  such  as 
ton  he  am  mixing  has  been  shown  to  promote  silicide  formation  at  reduced 
temperatures.  The  sheet  resistance  of  the  Si  substrate  decreases  due  the  peak¬ 
ing  of  implanted  dopants  at  the  metal-Si  interface.  However,  the  implantation 
Increases  the  contact  resistance  by  a  factor  of  five. 

The  effects  of  ion  beam  mixing  on  contact  electrical  properties  are  deter¬ 
mined  using  a  test  vehicle  that  contains  patterns  for  measuring  the  contact 
resistance  and  Schottky  barrier*  height  of  contacts  ranging  from  20  to  1.5 
microns  square.  Metallurgical  reactions  are  monitored  using  Rutherford  Back- 
scattering  Spectrometry  and  SEM. 


